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Abstract 

A conceptual framework for the analysis of dichotomous and ordinal polychotomous re¬ 
sponses within a penalized multivariate Generalized Linear Model is introduced. The pro¬ 
posed structure allows for a rather flexible predictor specification through the inclusion of 
non-parametric and spatial covariate effects, and the characterisation of the distribution of 
the stochastic model components with copulae of univariate marginals. Analytic deriva¬ 
tions for the particular case of Gaussian marginals within a bivariate system of dichotomous 
outcomes are also provided, and the framework is subsequently illustrated through the esti¬ 
mation of the HIV prevalence in Zambia using the 2007 DHS dataset. 

Key-words: Copulae; Generalized Additive Models; HIV Prevalence; Multivariate Discrete 
Data; Penalized Regression Splines. 


1 Introduction 


Generalized Linear Models (GLMs, Nelder and Wedderburn 1972) are a comprehensive class of 
models that allows us to conduct estimation and inference for a variety of response types within 
the same coherent unifying framework. However, despite their undoubted relevance in applied 
research, they rely on a purely parametric specification of the covariate effects on the response, 
which effectively constraints the linear predictors to be a determined fixed-order polynomial, 
for instance. This is a strong requirement, as one cannot typically expect to know in advance 
the actual form of covariate-response relationships. This is especially the case in observational 
studies where their “experimental” situations are not conducted in a controlled manner. An 
actual risk for the researcher, therefore, would be to incorrectly specify the functional form of 
covariate effects, hence to potentially generate a non-negligible source of bias whenever these 
are not adequately represented. 

An existing approach to overcome this limitation is to consider a more flexible class of models 
that permits the representation and estimation of the additive effects of some continuous covari¬ 
ates of interest in a data-driven way. Methods of this kind are usually termed semi-parametric 
in the statistical literature (although this denomination is generally not shared by econome¬ 
tricians) because they conjugate both a parametric and a non-parametric characterisation of 
the functional forms of the regressors. Specifically, whenever the baseline structure is that of a 


GLM, the so-called Generalized Additive Models emerge (Hastie and Tibshirani 1986, 1990), 


which complement their parametric counterparts by adopting a regression spline approach, im¬ 


plemented in a computationally stable and efficient manner by Wood (2006a). Nonetheless, as 


any traditional regression analysis, GAMs are effectively models for the mean of a random vari¬ 
able possessing a certain conditional distribution function. To enhance flexibility, therefore, it is 
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also licit to extend the framework to qualify the dependence of any moment of order higher than 
one on some explanatory variables of interest. In this way, the risks of mis-specifying the models 
and of conducting invalid inference from them is alleviated. This approach usually comes under 
the name of distributional regression, whose ideas have been variously incorporated within a 


GAM setting: for example, Rigby and Stasinopoulos (2005) proposed a Generalized Additive 


Model for Location, Scale and Shape (GAMLSS), whose framework has been recently extended 
to the multivariate case by Klein et al. (2015). A review of these and of some other existing 
methodologies is presented in Kneib (2013). This line of research then seeks to achieve a higher 
degree of flexibility by increasing the number of distributions allowed by the proposed model 
representations, and by including in their respective specifications various kinds of covariate 
effects. 

The present work aims at following these auspices in the context of discrete outcomes. Start¬ 
ing from the definition of a GAM for a J-variate vector of categorical responses as a penalized 
GLM, we discuss the conceptual representation of dichotomous and ordinal polychotomous de¬ 
pendent variables in terms of a triplet (r, Fj, Z), and of a penalty matrix S.\ that allows us to 
incorporate in the model various instances, like non-parametric, spatial and random covariate 
effects. A method for dealing with a mixture of those two types of responses is also outlined. 
We then show how a generic estimation algorithm can be derived and inference subsequently 
conducted within the resulting multivariate Generalized Additive Model, and we argue that 
such algorithm can be, mutatis mutandis, applied to any model representable in the ( r,Fj , Z) 
form. Although the pace of the discussion is intentionally kept at a quite generic level, con¬ 
nections between the proposed framework and some existing models are made. These have the 
dual scope of motivating our representation with well-developed examples from the literature 
and, at the same time, of offering a way to extend them to the more flexible predictor speci¬ 
fications that form the domain of our work. In particular, attention is given to nested models 
accounting for unmeasured residual confounding: an instance rather frequent in observational 
studies and that may lead to detrimental consequences on the parameter estimates whenever 
it is not adequately controlled for (e.g. Becher, 1992). The proposed representation is then 


used to define a sample selection model for dichotomous responses to credibly assess the human 
immunodeficiency virus (HIV) prevalence in Zambia. With this empirical illustration, we give 
evidence of the flexibility of our generic representation which also permits the inclusion of mul¬ 
tivariate distributions defined through copulae of univariate marginals, and the dependence of 
the corresponding association parameter to be expressed as a functional of the available data. In 
summary, therefore, this paper contributes to the literature by providing a flexible tool for the 
representation, estimation and inference in multivariate GLMs for discrete responses admitting 
non-parametric and spatial-dependent covariate effects, and by accounting for the unification 
of models for residual confounding under the same conceptual frame. 


2 A GAM Representation for Discrete Responses 

Let Y = {Y\, .. ., Yj) T be a random vector with support the discrete set 1C := /Ci x • • • x /Cj, 
where /Cj := {1,..., I\j} and #(/Cj) = Kj < oo for every j e J , J := {1,..., «/}; namely we 
consider each variable Yj to have finite Kj levels. The set /Cj is assumed here to collect both 
qualitative and quantitative elements, as well as variables measured on the nominal or ordinal 
scale (Stevens, 1946). Specifically, the former differentiates items based only on the categories 


they belong to, whereas the latter allows also for a rank order by which the realisations of Yj 
can be sorted, but still the relative degree of difference between them lacks of any meaningful 
interpretation. For notational convenience, we represent each kj by a natural number, /Cj C IN, 
with the convention that, wherever the support of Yj is ordinal, we postulate the existence of 
an isomorphism that maps bijectively each element of the qualitative ordinal set 1C* onto /Cj. 
In this case it holds: k* A k* in 1C* if and only if kj Y kj in /Cj, and we take the set (/Cj, R) to 
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be totally ordered. 

In analogy with the approach outlined in Peyhardi et al. (2014) for the univariate case, we 
consider a regression of the probability 7 r*, = P[Y" = k\X = x\, with k := (k ±,..., kj) T £ /C, on 
some covariates x := vec(xi,..., xj) defined through a Generalized Linear Model form 


7V = g \r)):=(r 1 o F )^,..., 77^-1), ( 1 ) 

where r : M —> V is a diffeomorphism from M := {(0, 1) a_1 |1 T 7t < 1} to an open subset V 
of ( 0 , and with 7 r := {TTk}k£K.\{K}- Model (|TJ) also comprises the map T : S —> M., with 

S C R^ _1 , and the array J~(r]) := (Fj(t 7 1 ), ..., Fj(r/ K _ 1 )) T is taken to collect fully-specified 
J-variate distribution functions, each of them evaluated at r] k := ( 771 ^,..., VJ,kjV = Z/3 fc , a 
linear predictor. Wherever needed, we assume that the elements of K, obey a lexicographical 
order, that is (k \,..., kj) A (k \,..., kj ) if and only if kj A kj for all j £ J or (kj = kj A kj A 
kj for some j E J). A more traditional GLM representation for the k -th category can be 
recovered from 0 and reads as 


r(vr fc ) = Fj{r ] k ) = Fj(Zf3 k ), 


( 2 ) 


where r is now a function specific for the type of the responses Y. For instance, in the univariate 
framework, dichotomous variables would set rj such that ir kj i—>• ir k ., the identity map, therefore 
([ 2 ]) reduces to any model for the binary outcome, say a logit or a probit, depending on the 
definition of F\. Models for ordinal polychotomous responses, as the Cumulative Link Model 
(CLM) of McCullagh (1980), also possess this representation and set the left-hand side of ([ 2 ]) 
as rj{Tr kj ) = 7Ti + • • • + 7Tfc.. Although more specifications of different univariate response types 
are illustrated in Peyhardi et al. (2014), in this work we confine ourselves to the sole study of 
dichotomous and ordinal outcomes, since they are the most frequent instances of the class of 
models we aim at developing. 

As ([ 2 ]) explicates, any GLM for discrete responses is fully characterised by the triplet 
( r,Fj,Z ), where the design matrix Z depends on the covariates x, though not necessarily 
coinciding with them. For example, let the polychotomous response Y follow the model 


r(*k) = E ■ ■ ■ E - X/3) = Fj(Z0 k ), (3) 

ki<ki kj<kj 

then Z := diag(z7, • • •, zj) and (3 k := vec {(3 lk ,..., f3 J%k ), where z j := (1, -x i;mj ) T 

and /3j k := . /3y 1 ,..., Pj >mj ) T ■ hi the proceeding analysis, we call the Cj^.’s threshold 

parameters or cut points, and we assume they are the only elements in the corresponding linear 
predictor rjj :kj to depend on the categories of Yj. We also stress that only Kj — 1 cut points 
are effectively estimable in this framework because, in order to allow the domain of Fj to 
coincide with the extended real hyper-plane IR" 7 , we need to impose Cj t Kj = 00 for any j and 
Cjfl '■= Cj, 1-1 = — 00 . As a consequence, a dichotomous response with support ICj := {0,1} 
would set the only threshold to 0, and the model intercept is now estimable. 


Ordinal Polychotomous Outcomes This instance is of some interest in terms of the pro¬ 
posed GLM specification and worth to be discussed further. Notice first that the given definition 
of r(iT k ) is posing a constraint to the set V. Specifically, by assuming kj A kj, we have 


r(ir k ) = E ■■■ E 


7r 


ki,...,kj 


k±<ki kj<kj 

E ■■■ E ■■■ E 7 T ki,...,kj + E ••• E E 

k\<k\ kj<kj kj<kj k\<k\ kj$z[kj ,kj\ kj<kj 

r (n) + r'(’Tfc) > r ( 7r fc) 
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since r'(iT k ) is the sum of probability measures. We also deduce k := (k \,..., kj ,..., kj -\) Y 
(fei,..., kj ,..., kj) =: k by the assumed lexicographical order. Hence r( Tr k ) < r(TTk) for any 
k A k, and V := {r G (0, l) A ~ 1 |r(7r^.) < r(iTk), for all A; A k and k, k 6 /C}. 

If this restriction comes from the very construction of a CLM, a second one emerges to 
let the model meet a general coherency condition. To establish this result, the inspection of 
©> reveals that the linear predictors {rj k } depend on the element k of the discrete ordered set 
K that one attempts to model. The sought coherency requires, therefore, the definition of a 
specific correspondence between the order relations existing in k 6 K with those in r\ k 6 5, 
This is identified, in particular, in the order embedding of each Kj into a relevant subset of 
the real line as induced by the thresholds {cjj Z;j }: in this way, it is possible to construct non¬ 
overlapping hyper-rectangles in isomorphic to (k\ ,..., kj) € K. In terms of a multivariate 
CLM, the order embedding is guaranteed by taking the cut points {cj ikj } to be an increasing 
sequence in kj for every j £ J wherever, as stated above, the threshold parameters are the only 
quantities in the linear predictors depending on the categories of Y r Consequently, it follows 
the isomorphism {r] k } = {k}, meaning that there exists a bijection <p : K — > S such that 
<p(k) := rj k Y rj k =: <p(k) in S if and only if k Y k in K. In this case, the domain of J- is 
then restricted to be the set S := {rj E 'K K ~ 1 \r] k Y r] k , for all k Y k and k,k E K}. In the 
bivariate case, J = 2, Dale (1986) imposed a strict monotonicity on the cut points to imply 
a non-degenerate probability measure on K. Although this condition would in turn debar one 
possible source of the Maximum Likelihood Estimator to be located at the boundary of the 


parameter space with non-null probability (refer to Haberman, 1980 for the univariate case), 
we reckon this restriction has to be avoided as it arbitrarily excludes a still admissible estimate, 
albeit at the boundary. Arguably, two congruent subsequent cut points are not in contrast with 
the coherency principle. In fact, given any two elements k and k of K such that either kj = kj 
or kj -< kj , the coherency implies that the occurrence Cj k . = Cj :kj is verified if and only if 
kj = kj , that is whenever K := K\ x • • • x (Kj \ {kj}) x • • • x Kj, unless kj is an element of zero 
probability mass in K. Either cases correspond to observing zero counts for the kj- th category, 
but with the coherency would be still in place. 

Under this principle, it is possible to motivate ordinal polychotomous responses through a 
generating continuous latent random vector Y* := (Y )*,... ,YJ) T in in such a way that, 
upon letting e be the stochastic component in the regression Y* = ~K(3 + e, it holds 


{Y A k} «=► {e < rj k }, 

where the right-hand side is intended component-wise as ej < rjj :kj for every j. 


(4) 


2.1 Specification of the Linear Predictors in a Penalized GLM 

Linear predictors enter representations (jTJ) or ([2]) as domain of the distribution functions col¬ 
lected in J 7 , and are fully characterised by the design matrix Z for any vector of parameters 
(3 k . In the proceeding, we assume that each predictor r/j tkj depends parametrically on some 
covariates Xj. and through an additive form of unknown smooth functions Sjj^ : It —> It for 
the remaining continuous regressors Vj := (vjj,.... Vj ) ^ j ) t . The resulting functional form is 
then termed semi-parametric, and defines the class of (Vector) Generalized Additive Models 


(VGAMs; Yee and Wild, 1996). We prefer, however, to adopt the alternative terminology of 
penalized GLMs as, in our opinion, it reflects better the features of the class of models we 
discuss beyond the traditional domain of GAMs. 

For dichotomous and ordinal polychotomous responses we define the linear predictor to be 


v j,0) = c j}kj ~ xj (3j - Sj,i(vyi)--VMV./J v j,h e R - 


(5) 


where denotes the parameter vector, and the smooth functions are represented by regression 


splines using the approach popularised in the literature by Eilers and Marx (1996). Assume 
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first that we have a sample of n observations indexed by the subscript i. The underlying 
idea of the method is to approximate each curve by a linear combination of known basis 
spline functions, b h i . for hj = 1 and unknown regression parameters to be esti¬ 

mated within the system, Sjj. ^. In our notation, hj is employed to count the bases, as 
delimited by some knot points in the interior of [vj^.m, Vj,i-,(n)] f° r every j. Upon defining 
hj.ijivj.ij.i) '■= , and 8jj j the corresponding Hj- dimensional 

vector of parameters associated with the smooths, it holds 


The evaluation of i^y. for each i yields Hj curves - encompassing different degrees of complex¬ 
ity - that give, once multiplied by some real-valued parameter vector and then summed, an 
estimated curve for Sjj.. Basis functions are usually chosen to have convenient mathematical 
properties and good numerical stability: possible instances are B-splines, cubic regression and 
low-rank thin plate regression splines (e.g. Ruppert et al.[ 2003 and Wood, 2003). For identifia- 
bility purposes, a centering constraint such as s j,lj ( z j,lj,i ) = 0 for every lj has to be imposed, 
which is automatically incorporated in our model representation using the parsimonious ap¬ 
proach of Wood (2006a). 

We are now in the position to express the functional form of linear predictors <§ through 


a more compact and comprehensive representation. To this end, let us define P\ 3 j ] \ 


:= Si 


•3 J \hh\ 


the sub-vector of (3 corresponding to the th smooth and, accordingly, X^y j the covariate 


matrix whose i-th row is bjj . (vj^.y). It then follows that the n-dimensional vector of linear 
predictors for the j-th response can be written as: 


TJj . Cj Xj ; l/3jy ••• X j ) Mj f3j' Mj ^jPji (6) 

where Zj := (l n , -Xjy, ..., Xj ;m; ., • • ■ , X J)Mj ) and (3j := vec(c^ k;i , f3 ]{ , .. We further 

assume j = 1,..., J, with J > J to possibly specify any association parameter implied by 
the distribution Fj in terms of some observed independent variables. So re-written, the linear 
predictors conform notationally with the GLM given in ([3]), with the caveat that now they 
can be indifferently used to represent linear and non-linear covariate effects within a generic 
GLM for multivariate discrete data. To see this, set rj := (rj 1 , ..., rjj) to be the array whose 
i- th row is r) k = ..., ..., rjj) T and, accordingly, (3 k := vec ((3 1 ,..., (3j). Thus 

Z i := diag(z 1 r i ,... ,z~j .), with zj- being the i-th row of Zj, and the linear form that defines the 
right-hand side of ([3]) is now recovered as rj k := Z*/3 fc . 


Characterisation and Definition of a GAM as a Penalized GLM To enforce certain 
properties of the covariate effects, a ridge-type penalisation is assigned to each column of Zj, 
namely Vj, mj '■= Sj.m ,■ where the dimension of S j :rnj is generically denoted by 

q. The smoothing (or tuning) parameter \j, mj £ [0, oo), in particular, is introduced here to 
control the trade-off between smoothness and fitting in the non-parametric estimation of Sj )mj . 
Specifically, as A j tmj tends to zero, less penalisation is attached to the regression coefficient 
f3j m . and the estimation occurs either at the pre-specified polynomial form for the parametric 
model components, or at the spline interpolation for the unknown functions. On the other 
hand, an infinite value of Aj imj . results in the fitting of a straight line, a situation also known as 
over-smoothing. 

The penalty can be used to describe several covariate effects in the same unifying 

manner. In particular, a parametric functional form would set S j tJn - = 0 gig , and the corre¬ 
sponding X;.„, reduces to Xj jTnj ^j )mj , with Xj, mj . := (xj^.y,... ■x J: m :j , n ) T whereas, in the 

presence of non-parametric effects, one can specify the penalty through the symmetric and 
positive semi-definite matrix 
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a generic measure of the curvature of the estimated (j,m,j )-th smooth function (see Green 


and Silverman, 1994 for a detailed introduction to this roughness penalty approach to curve 
estimation). 

Furthermore, Vj, mj is compatible with the specification of random effects models (S= 


Iq), as well as with the definition of spatial covariate dependence. As recently outlined by Klein 


et al.| (2015) and Marra et al. (2015), this approach can be employed wherever the phenomenon 


of interest to possesses characteristics that vary according to the geographical location of each 
individual. Let G 1Z, 1Z := {1,..., ii}, be the region to which the i-th observation belongs, and 
define R the corresponding nx R design matrix of the spatial effects. This sets ,% = 

x ( . so that we estimate separate parameters /3 l ,..., (3 R for each region, and R is an incidence 
matrix, namely an array such that Rlj = 1 if observation i belongs to r 6 1Z, and 0 otherwise. 
Then, for discrete spatial effects, a Markov random field (e.g. 
penalty 

-1 r ^ s A s <E 5 r 
0 r / s A s ^ 6 r 


Rue and Held, 2005) induces the 




— Slr,s] 


N r r = s 


with 5 r denoting the set of neighbours of region r, and N r the number of regions in 5 r . 

Once every component in Z j is endowed with a proper penalisation depending on the desired 
effect one is willing to model, and S j tTn adequately adjusted to meet the splines’ centering 
constraint, an overall penalty for the whole model can be constructed as V\ := where 

corresponds to padded with zeros so that = /3 T S\/3, := diag{Sj >rrij . 

and A := } r n :n j- A penalized GLM is therefore defined as any model in the form of S 

augmented with a non-zero penalty V\. 

In the next section, we qualify the generic framework to describe a class of models widely 
used in applications, and we show how it can be represented within the (r, Fj, Z) frame. In 
this way, these models can be extended beyond the parametric specification of their functional 
form of covariate effects, and their estimation and inference will then be a direct consequence 
of those of a generic multivariate penalized GLM for discrete responses. 


2.2 Some Bivariate Models in the Class of Penalized GLMs 

The analysis of observational data may be difficult as they often depart from the ideal conditions 
underlying any (also rather simple) regression model. They are commonly characterised by a 
lack of randomisation that may result either in a non-random selection of individuals in the 
sample, or even in the non-random allocation of a predictor of interest among the population 
(hence inducing a distorted association with the outcome). The former is commonly referred to 
non-random sample selection, and arises whenever individuals select themselves in or out of the 
relevant sample. It is often the case that some factors that determine the membership to the 
selected sample are also associated with those that determine the outcome itself. In the empirical 
illustration accompanying this work, and concerning the estimation of the HIV prevalence in 
Zambia, the refusal of people to be tested for the virus may be induced by factors associated to 
their HIV status. For example, they may already know or correctly predict their seropositivity 
and so fear others will learn about it if tested. The latter instance is regarded instead as a 
form of endogeneity as it is denominated in the econometric literature, and it may stem from 
including, but not limited to, the direct unmeasured confounding problem; 
) discusses this in detail as well as other generating sources of endogeneity, and 
we refer to him for a more thoughtful illustration of the topic. This situation arises whenever 
a common background variable affects simultaneously both the outcome of interest and one of 
its regressors, and it is not readily observable or quantifiable by the researcher. The affected 
covariate is then termed endogenous, and its effect on the outcome results confounded. A 
pedagogical example is the estimation of the effect of education on wages. Both the relevant 


different sources, 


Wooldridge (2002 
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variables in this study can be co-determined by factors such as personal ability and motivation 
that are likely to be explainable to both individual’s level of education and salary, but hardly 
measurable (see for example Imbens, 2014 for an interesting survey on the topic). 

When not accounted for, non-random sample selection and endogeneity can both lead to 
inconsistent parameter estimates for the whole model. To deal with these issues, in some early 
works, Heckman (1978, 1979) devised a two-step estimation procedure for a prototypical recur¬ 
sive bivariate system of equations in a dichotomous responses setting, with Y = (Yf,!^)" 1 " and 
Y 2 (Y l ). His proposals specified a binary rule for the observability of the outcome of interest, Y 2 , 
for the non-random sample selection case, and related the conditional mean of the endogenous 
regressor, Y \, to various other predictors when endogeneity is suspected. In either scenario, 
identification of the true association between the elements of Y would require to be able to 
qualify the dependence of Y\ on a relevant variable which is assumed to be independent of both 
Y 2 \Yi and the unmeasured confounder(s). 


Unmeasured Confounding We consider the case where both the responses are discrete, 
and we specify the following triangular generating structure for the j -th categorical response in 


terms of a latent variable formulation, as of Case 3 in Heckman (1978), 


Y* = l j=2 {^Y*} + ^J, 1 P j , 1 + 


' ' + X J,Mj 


P 




+ 


(ei,e 2 ) T ~ A7 2 (0,fi;p), 


(7) 


where xj m denotes the (j,rrij )-th row of X jmj • The given distributional assumption is in 


line with the considerations of Greene and Hensher (2010) and with the current practice in 


multivariate discrete response modelling. In particular, we set J2 := Var[e] to have unit main 
diagonal elements for identifiability purposes and correlation coefficient p. Let us next define 
the following quantities: 


r : = 


1 0 

— %b 1 


and L:= 


1 0 
0 yr+%T?J ’ 


where L is a lower triangular, positive definite matrix since (1+2i/j p+i/j 2 ) = (1 -p 2 )+(p-\-ip) 2 > 0, 
and with L Y* = Lr _1 (X/3 + e) distributed as Standard Normal random vector with covariance 
matrix X = Lr _1 fir _1 L T . This structure constitutes the most general one we discuss in this 
paper, as it nests the vast majority of the other model specifications for unmeasured confounding 
and non-random sample selection currently proposed in the literature. Notice that the manifest 
polychotomous ordinal responses can now be recovered from Q via the series of equivalences 

{Y Y k} <*=>■ {TY* < c fc } {LT^e < LT _1 Z/3}, 

which implies that the predictor L T~ 1 r] k is non-linear in $, an occurrence that has to be ac¬ 
counted for in the derivation of the estimation algorithm. The corresponding (r, F 2 , Z) definition 
of the triangular form 0 for every k £ K. is then 


I 7r fci,fc 2 ’ $ 2(hi,fc 1 ,h2,fc 2 ;^) I Lr 1 Z i \ , (8) 

\ki<ki k2<k2 ) 

where Z is defined as in ([b]) and, in the case of /C := {0,1} X {0,1}, the first entry of the 
triplet becomes r^n^) = while the design matrix has to be modified accordingly in order 

to accommodate the instance cyo = 0 for any j. This representation of the recursive structure 
allows explicitly for a latent endogenous predictor to be a determinant of the intentions about 
Y 2 . That is, if one interprets the intentions towards a manifest discrete outcome (the actual 
action) as the result of an underlying choice mechanism as described by Y*. then (|8| really 
assumes the existence of unobservables that influence simultaneously the intentions aDout the 
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components of Y. For example, there is a vast economic literature pointing out that the choice 
of investing in both health and education are confounded by individual time preferences, in the 
sense that people with low (high) rates of time preference are more (less) likely to decide to 
invest in both schooling and health (Sander, 1995, Fuchs, 1982, van der Pol,|2011 ). In this 


case, 

endogeneity is regarded to act at the level of the choice of how much to invest in education and 
future health status. A researcher could nonetheless be interested in modelling the effects of an 
observed endogenous variable (where the intentions have been revealed by the actual choices 
undertaken) on the discrete response Y 2 : this instance then specifies T = L = I 2 and Z is 
assumed to include the levels of the manifest Y\. A discussion about the distinctive features of 
these different modelling strategies can be found in Vossmeyer (2014), who also introduced a 
formal Bayesian model comparison framework to test these two competing models against the 
observed data. 

Models for a mixture of dichotomous and ordinal polychotomous responses can also be 
reconciled within this representation by giving a proper definition of r(ir k ). In fact, this is the 
only element which is directly affected by the types of responses considered, whereas the design 
matrix mainly attains the functional form assumed for the predictors, and F 2 specifies the link 
function. To describe this situation, we may think of r as the composition function r = r 1 o r2, 
where the subscripts correspond to the elements of the 2-dimensional vector Y they refer to. 
Therefore, by setting 


r-j(TTfc) =7r fe ., % and rj{n k ) = ^ n k , - k _, 

%<% 

we have rj(rj(ir k )) = for any k,k G K, and J := J \ {j}. The result follows because, 

for dichotomous responses, rj is the identity map, meaning that it is also indifferent the order 
in which the types of the outcomes appear in Y in terms of the model representation. In 
other words, irrespectively from whether Y\ or Y 2 is the dichotomous variable, the function 
composition (rj o rj) is commutative: 


(rjorj)(7Tfc) = rj(n k ) = rj(rj(ir k )) = (rj o rj)(ir k ). 


Non-random Sample Selection In this case, it is assumed that the outcome I 2 G /C 2 is 
observed if and only if {0,1} 9 Y\ = 1, whereas it is labelled as missing otherwise. As a 
consequence, the vector 7 r results further constrained: every element in the form of ttq^ 2 is 
now not a sensible quantity in the model for any k 2 , since it refers to a missing value in the 
realisation of Y?. Hence, one can only describe the corresponding marginal probability ttq., which 
is translated mathematically into the map A4 — > Ai s defined by ttom ^ Yl k2 eK c 2 fc 2 =: 7r °'’ 
and 7ri ; fc 2 *->• for any /C 2 G K .2 \ {A' 2 }, where M. s := {7r s G (0, 1)^ 2 |1 T 7 t s < 1}. Notice that, 
in complete analogy with the general case, if 7r s is augmented with 7Ti } k 2 i the components of 
the resulting vector will sum up to the unity. Hence, the (r, F 2 , Z) representation of this generic 
sample selection model would require just to exploit the corresponding function r as depending 
on the type of the response Y 2 . In particular, for a dichotomous response Y 2 , 

r(7T s ) = (7r 0 .,7ri )0 ) T = ($ 2 (-r?i, 00 ; 12), <f> 2 (hi, ~V 2 ] ^)) T = ( 9 ) 


where Q stems from S upon imposing the restrictions T = L = I 2 , and <J> 2 (~hi, °o) = 
< ^ > 2 (—??i, —^ 72 ) + (H> 2 (—??i, ^ 72 ) — < f > 2 (— 771 ’ —^ 2 )) = <h(-r/i) as from the map characterising the 
sample selection problem. This bivariate probit model was originally proposed by Heckman 


(1979) and subsequently extended to encompass penalized regression splines by Marra and 
Radice (2013). As a natural generalisation, one can also consider the support of /C 2 to be 
totally ordered, #(£ 2 ) > 2 like in Miranda and Rabe-Hesketh (2006), whose corresponding 
representation within our framework comprises 

r( 7T) = (tT 0 . , TTgl, • • • ,771,1 + 


+ TT lt K 2 -i) r , and 


8 































F(v) = 1 (^ 2 (r?i,i,r? 2 ,i;fi)),...,r 1 ($ 2 ( 771 , 1 , m,K 2 -V, n ))) T , 

where the generic r~ 1 (F 2 (r]i ! k 1 , ??2,fc 2 )) can be computed as the non-negative volume of the 
rectangles [r/i )fel _i, r?i,fcj X [%,fc 2 -i, 02,fc 2 ] in 


3 Estimation 


Let the conditional distribution of (Y"|JC = x) obey a Categorical distribution C(tv(x)) with 
mass function 

fY\x(y\x) = TTk(x) ly = k , (10) 

k£K. 

where t y=k is a Boolean function that takes value 1 if (yi = k\ A • ■ ■ A yj = kj) and 0 otherwise. 
Then, after having re-defined the response vector y = (l y =i,..., \ y =k) T , the distribution (llOj) 
can be written as 

fv\x(y\x) = exp { y T e - 6(0)} , 

where 


Ok = In 


1 - Efc 


, Ok = 0 and 6(0) = In ^ 1 + exp{0fc}^ 


which shows that C(rv(x)) can be expressed in the exponential form, and hence all the standard 
properties implied by this family of distributions follow. If we further take {(**, 2/j)}f = i a 
sample from Fy and Fx, where the y^s are assumed conditionally independent given the 
regressors, then equation (10) can also be used to derive the log-likelihood function of any 
multivariate model for discrete data admitting a (r,Fj, Z) form. Specifically, by denoting £j(i?) 
the contribution of the i-th observation to the log-likelihood, the iterative application of the 
chain rule results in 


V &ti = 


drjk f dF k dn k d0 k dtj 
d'd V dr) k dr k dir k d0 k 


D7 U; 


and V. 




= D, T W i D i + K i 


( 11 ) 


where 


W; = 


dVkdvJ dr k dn k dO k dy k dr k \drj k dr k J [ d0 k \dn k J d0\ 


2 02/ 


d 2 F k dn k d0 k dtj | dF k dn k ( dF k dn k ^ J d 2 0 k dtj | ^ d0 k ^ d 2 tj 


+ 


and 


K, = 


d 2 t) k dT k dn k d0 k dti 


d'dd'd T dr] k dr k dn k d0 k ' 

These expressions are analogous to those derived by Green (1984) in the context of iterative 
re-weighted least squares (IRLS) estimation of likelihood functions. Indeed, the baseline model 
is rather similar, with the sole relevant difference being the acknowledgment that only in some 
special cases r( 7r) = tv. In particular, wherever r is the identity map, u = vec(ui,..., u n ) 
reduces to the same simplified expression, d^/dr) k , that appears in Green (1984). Factor 
Kj is somehow unusual, and generally it is not reported in the relevant literature on GLMs. 
In fact, it is structurally equal to Q P)P wherever each rj k is linear in the parameter vector; 
however, this may not be true in some instances as shown, for example, in the triangular 
systems of equations having representation ([8]). The Information Matrix can also be derived: 
recalling that, for the exponential family of distributions, ddi/dQ k = y k — n k , b'(0 k ) = Tr k and 
d6 k /div k = [6"(0fc)] _1 = Var^fc]” 1 = [7r fc (l - 7r fe )] -1 , we have E[Kj] = 0 PiP , while 


dFkdrvk / dFkd 7rA T 1 
dr) k dr k \dy k dr k ) vr fc (l - Tv k ) 
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so that the Fisher Information component corresponding to the i-th observation is given by 

Xi = -EfV^T^] = I); W !),. 

Each individual matrix is finally aggregated into appropriate arrays to get a global representa¬ 
tion of score and Hessian as follows: D := (Dj| ■ ■ ■ |D,[|Ip) T , u := vec(ui,..., u n ,0 p ), — W := 
diag(Wi,..., W n , K), with K := K u so that V^(R) = D T u, and V m t£(0) = -D T WD. 


3.1 Penalized Likelihood 


The quantities derived above have been obtained only by the knowledge of the (r, Fj, Z) repre¬ 
sentation of the model that, alongside with the panalty matrix S A , embodies all the information 
needed to achieve estimation. Recall that any covariate effect other than a purely parametric 
specification requires the exploitation of certain features as included in the penalisation term 
V\. To account for them, a Penalized Likelihood (PL) is usually set up for estimation, and the 
corresponding MPLE is then defined as solution of the following optimisation problem 


R := arg max f p (R, A) 

Pee 


n i 

5>(T7 fe (tf)) - 2 tf T S A tf 


( 12 ) 


which is obtained from any fixed value of the smoothing paramter vector A. Because the 
quadratic form V\ is positive semi-definite by construction, the joint estimation of (R, A) would 
clearly lead to over-fitting since an optimal value for f p (R) would be reached at a state where 
A = 0. Our estimation strategy comprises therefore two alternating steps based on the outer 

(1986). Specifically, an estimate (R|A = 


O’Sullivan et al. 


iteration scheme originally proposed by 
A 7 ) is first obtained from any value A 7 via the maximisation of -£ P (R|A = A 7 ), which is then 
used to update a value of the tuning parameter vector. The whole procedure is iterated until 
convergence. 

Although a solution to problem (12) can in principle be obtained through any numerical 
optimisation algorithm, our subsequent analysis requires some of its iterations to be either 
of Newton-Raphson or of Fisher scoring-type to match with the derivation of the smoothing 
parameters vector. 


P-IRLS Scheme for Estimation Rather than handling the log-likelihood maximisation 
directly, it is convenient to define a penalized iteratively re-weighted least squares (P-IRLS) 
scheme based on quantities Let us first derive the Taylor series approximation of the 

function V#£ Pt i about the vector R = R — Rq, 

~ Vtf/p,* + ^ P) j(R — $ 0 ) = 0 P) 

where the last equality holds from R being the MPLE, and with Vg^ Pi j standing for V#-£ P) iL = ;g. 
Under the assumptions that D has full rank p, and W is positive definite throughout the 
parameter space 0, a Newton-Raphson algorithm comprises the non-singular p x p system of 
equations for R 

(V W T^(#1) - S A )tf[“ +1 l = (V^T^tfW) - S A )RM + (S a rM - Vtf*(RM)) 

(D t WD + S A )R* = DWz, (13) 

where z := DR + W _1 u defines the pseudo-data vector associated with any (r, Fj. Z) model. 
Moreover, equation ( |l3| ) is expressed in terms of R* := R[“ +1 ] ; while the dependence of all the 
other variables on the [ccj-th iteration is neglected to avoid clutter in the notation. Finally, 
by noticing that the above system can be recovered directly from the normal equations of a 
Generalized Least Squares (GLS) regression of z onto the columns of D, using a weight matrix 
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W and a ridge-type penalisation, it follows that ( fl3| ) corresponds to the closed-form solution of 
the problem 

D* = argmin ||v / W(z — Dt)|| 2 + tS A t. 
te© 

In other words, at every [a]-th iteration, the GLS recursion produces a closed-form expression to 
update the optimisation algorithm, and this is repeated until convergence. Apart from giving 
an elegant solution to the log-likelihood maximisation problem, the P-IRLS algorithm also 
establishes a correspondence between MPLE and GLS, and this provides us with an equivalent 
expression smoothing parameter selection can be based on. 

Remark 1. The use of matrix D in the computations above reflects the possibility of dealing 
with models involving non-linear predictors. In other simpler instances, this quantity reduces to 
the design matrix Z, with potential gains in the computational time of the P-IRLS procedure. 
In fact, D would usually depend on some functions of the parameter vector which need to be 
updated at every iteration; whereas, in the case of D = Z. this quantity can be stored outside 
the iteration loop. 


3.2 Smoothing Parameter Selection 


The correct specification of the “right” amount of smoothness is important for any practical 
modelling in non-parametric regression. In what follows, we adapt the Un-biased Risk Estimator 


(UBRE; e.g. Wood, 2006a) to the present context, so that smoothness selection is achieved from 


quantities that are directly stemming from the (r, Xj,Z) representation of the model; a stable 


and efficient computational method to implement this criterion is discussed in Wood (2004). 


In principle, vector A should be estimated in such a way that the fitted curves are as close as 
possible to the true unknown functions. To this end, let us consider the large sample approxi¬ 
mation ~ X implied by the likelihood model and, under the regularity conditions 

listed in Section S.l, it follows D T WD D T WD, where W := diag(Wi,..., W n , 0 PjP ). 
Since E[Kj] = 0 PiP , it also holds that E[K] = 0 PiP from the linearity of the expectation, hence 
K = o p ( 1) and W = W + o p ( 1) as n —> oo. Further let 


P = 


V / WD(D T WD + S a ) _1 D t Vw 


denote the influence matrix of the associated GLS model, namely the array such that the 
predicted values of the response V\Vz can be written as p, := VWDil* = P\/Wz, and z := 
z(W) be the pseudo-data vector evaluated at the asymptotic weight matrix. Then, by letting 
/x := E[\/Wz] = ’/WDtl be the expected value of the GLS response, we define A as the 
minimiser of the expected Mean Squared Error (MSE) of /L Namely 


n -1 E||/i — /2|| 2 = n -1 E|| VWz — P\/Wz — 


= n 1 E 


W(z — DiT)|| 2 + ||e|| 2 - 2{VWz-P\/Wz;e) 


-1/2 

where the stochastic term above is given by W u =: e ~ (0^,1^) since E[u] = 0 P and 
Var[u] = W, while the expectation of the inner product results in 


— 2n X E 


Wz — PVWz; i 


= —2 n X E 


e T (I K -P)(VWDtf+ e) 


= — 2 + 2n i E[e T Pe] = —2 + 2n i tr(P). 


Then, the corresponding UBRE criterion for the [a + l]-th iteration step reads as 
A [a+1] := argmin V U (A) := || VW [ “ +l1 (z [a+1] - D [Q+11 tf [ “ +11 )|| 2 /n - 1 + 2ntx(V [a+1] )/h, (14) 
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Algorithm 1 Pseudo-code for MPLE Computation under a Outer Iteration Scheme 
Require: a £ (0, iter .max); k > 1; (r, Tj,Z); S,\ 

0[°1, AM 

while a < iter.max V max — $[ Q 1| > 10~ 6 do 

#[«+!] 


max,# 




D 


W 


[q+1] 


Qvk 

(M 


#=#[“+!] ’ * 


[q+1] 


1 dJ~ /, div k 
ir k dr] k dr k 


#=#[«+!] 


Q+1] 


1 d 2 T k _ [q+1] [q+1]T 

TTfe dr] k dr * 1 


• K^ a+1 ^ 
#=#[“ + !) ’ 1 


d 2 V k „[«+!] 


dtfdtf T U * 


#=#[“+!] 


compute D[“ +1 ], U [“ + 1 l, K[ q+ 1 1 , W[ q+ 1 1 , z [“ +1 1 and p[ Q+1 l 


A [Q+i] ^ millA 

end while 


| v ^ [ Q+ 1] ( z [q+i] _ d [q+i]^[q+i])|| 2 /^_ 1 + 2 Ktr ( P [Q+i])/n 


where n is a given multiple of the sample size as determined by the dimension of r] k , and 
accounts for the multivariate nature of the framework. For instance, if we consider a bivariate 
model where both the responses are ordinal polychotomous with just one association paramter 
7 , it follows that, for every individual i, the corresponding array 


Vk (%i—l,fc2 — 1) Vki—l,k2 •> Vki,k2 — li Vki,k2 ■> TO 


is 5-dimensional and n = 5n + p. An additional inflation parameter k has been included in the 
UBRE criterion and it can be increased from its usual value of 1 in order to obtain smoother 
models. In effect, based on experimental results, Kim and Gu (2004) suggested to locate 


k, £ [1.2,1.4] to correct the tendency of prediction error criteria to over-fit the estimated curves. 
Notice that the trace of the influence matrix P represents the effective degrees of freedom of 
the model; in a penalised framework, they usually differ from the number of parametric model 
components since the presence of the penalisation in the fitting algorithm can suppress some 


dimensions of the parameter space. As a final remark, expression (14) can also be interpreted 
in term of the log-likelihood AIC: 

Proposition 1 . Let £($) be the log-likelihood, function of a model admitting a ridge-type pe¬ 
nalized GLM form ( r,Fj,Z ) and penalty matrix S^, then the UBRE expression (If), V U (A), is 
proportional to the Akaike Information Criterion (AIC) with the parameter space dimensionality 
corrected for the presence of S^ ; namely 

V U (A) oc 2tr(P) -2 £{&*). 

Proof. See Supplementary Material |S.3[ ■ 


The structure of the resulting fitting procedure is detailed in Algorithm [lj and an illustration 
of its ability to recover the unknown smooth functions shown in Figure [l] by a simulation study 
for the triangular model ([ 8 ]) . Some considerations on the asymptotic behaviour of the proposed 
estimator as well as a method to compute confidence intervals for the included smooths are 


detailed in Supplementary Materials S.l and S.2, respectively. 


4 Real Data Illustration: HIV Prevalence in Zambia 

We illustrate now the proposed framework via the estimation of a sample selection instance. In 
doing this, we specialise our structure to describe a bivariate probit regression with association 


parameter explained through an additive linear predictor (Radice et ah, 2015). This feature is 


attractive in the context of unmeasured confounding as it allows to account for various degrees 


12 




























Figure 1: Monte Carlo simulation based on 100 replications for the non-parametric components 
of the bivariate triangular ordered probit model described by form ([8]): the first two graphs refer 
to the smooths included in the first equation, while the last corresponds to the model for Y 2 . The 
true functions are depicted in red and an average estimated curve is represented in gray (notice 
that for many observations it overlaps with the corresponding truth value). The computations 
were performed by the R function SemiParCLM that implements ([8]) and some nested models 
following Algorithm [l] on a model comprising 10,000 observation and correlation coefficient 
p = 0.9. Details on the Data Generating Process are given in Supplementary Material 5~4 


of non-random sample selection across observations, and it helps to explain how the association 
between the relevant outcomes is affected by common unobservables for different individuals 
and covariates. 

The resulting model is then applied to data from the 2007 Zambia Demographic Health 
Survey (DHS) to flexibly estimate the prevalence of HIV in the Zambian male population. 
Our analysis complements the study of McGovern et al. (2015) through the inclusion of non- 
parametric covariate effects, and the specification of the aforementioned elements proper of a 
distributional regression. The following discussion is further extended by Marra et al. (2015), 
to which we refer the reader for more extensive and thoughtful argumentations. All the relevant 


computations presented in the study are performed in the R environment (R Development Core 


Team, 2015) using the package SemiParBIVProbit (Marra and Radice, 2015) which implements 


the ideas discussed in this article for the binary case, and whose corresponding representation 
in the (r, F%, Z) form has been previously given in <©• Notice, however, that because of existing 
restrictions in the original data-set diffusion, just a one generated from it can be used for 
reproducibility purposes, and it can be accessed from the above-mentioned package (hiv); details 
on the employed model specification can be found in McGovern et al. (2015). 


4.1 A Dichotomous Regression Defined Through Bivariate Copulae 

The models presented in Section |2.2| where originally defined through a bivariate Gaussian 
distribution. This may be a strong assumption though, especially in applied disciplines where 
symmetries are unlikely or implausible: a mitigation of these constraints can then be achieved by 
extending the framework to copulae. As a first definition, let F\ j be the marginal distribution 
of the j -th component of Y*, and consider the map Cj : [0,1] J —> [0,1], such that 


Cj(F hl (Y*),..., F hJ {Y*j)) =: Cj{U 1: ...,Uj) 


is the joint cumulative distribution function of (Ui ,..., Uj) . Then Uj is uniformly distributed 
for each j e 77, and Cj is called the J-variate copula of the vector (lj*, 


» 1 J 


which is a 

bona fide multivariate distribution function under the Sklar’s Theorem (Sklar, 1959). Notice 


that, simply by denoting Fj(rj k ) = Cj(F\^(r)\ ikl ),..., Fi } j(r]j jkj )), any copula representation in 
principle belongs already to the class of models we have introduced; for a full account of copulae 
and their properties we refer to the monograph of Nelsen (2006). 
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A bivariate copula regression for dichotomous responses sets the probability of any 717 e 7T, 
for k € {(0,0), (0,1), (1,0), (1,1)} =: /C, as 

^ := p[yi = fc 1 ,r 2 = fc 2 ] 

= (r _1 oC 2 )(i ?1 i,i(?7i )i; i), J Fi,2(r?2,fe 2 );7*) = C 2 (F L i(t] 1M ), F h2 (r, 2M ); 7)> 

where the last equality follows for r being the identity map, and with 7 being an association 
parameter measuring the dependence between the two marginals. For optimisation purposes it 
is sometimes desirable to unbound the support of 7 , hence a specific copula-dependent trans¬ 
formation 7 * may be applied, which is taken here to be a function of the covariate vector X3. 
Since the corresponding (r, F 2 . Z) representation of this model for non-random sample selection 
is given by 

(7r£,C 2 ($(?7i )fel ), $( 772 ,fc 2 ); 7*(*3))» z ) 1 

in the proceeding all the IRLS quantities needed to perform estimation and inference can be 
derived from this, while the binding copula is intentionally left unspecified. 

The specialisation of the model for dichotomous responses simplifies the generic framework 
considerably. In particular, by neglecting any triangular structure (r = L = I3), Dj reduces to 
the 3 x p design matrix X* = (x^j, x 2 j, xs^)" 1 ", and K, = 0 P)P ; whereas the GLM representation 
implies dl t / d0 k = 1 — 717 for any k € K, actually observed, d 2 6 k /d7r k = vr A T 1 (l — 7 T k )~ 2 — vr A ) 2 (l — 
7 T fc ) _1 and d 2 li/d0l = —b"{6 k ) = — 717,(1 - 717 ). Let now r, k = ( 7717 ,7? 2 ,i, 7 *) T be the vector of 
the linear predictors evaluated at the z-th individual, where the subscript for 7 * is introduced 
to remark its dependence on x^^, then we can further decompose 

dF k _ dF k dc k 

d Vk dr, k dF k ( > 

to make explicit the contribution of the marginal distributions to F k . The score and the main 
component of Hessian matrix are then 

1 d 2 T k 1 8F k 8C k 


Y 7 p _ 1 vT 3F k dC k 

v flii - Xj 

7T k dr, k dF k 


and 


Wi = 


dFi 


dC k \ 

7r/c dr, k dril n 2 dr, k dF k \ dr, k dF k ) 


T 


with 


d 2 F k dF k d 2 c k 


dr, k dr, 


J 


dr, k 8F k dFl 


(dF k \ 
V dr lk ) 


T 


+ 


d 2 F k 8C k 
dr,kdril dF k ' 


If we further assume Standard Normal marginals for both the components of Y, (15) specialises 
as 

r 0 


dFk 

dVk 


9vi,i 

0 


0 


0 

d®(ri2,i) 

dil2,i 

0 


0 

d'H 

9F 



r 9Cfc(v;0 1 

r 





9C k (-,-y) 



9®(r)2, i) 



9C k (■,■;■) 



9"/i 



<t>( r t2,i)h 2 ,i 

Ufa 


‘3,i 


where hj.i and d'yi/d'y* are quantities specific to the copula employed. Moreover, d 2 C k /dF k dFj 
is the symmetric matrix with generic element hi tTn , t i = dhi^/d^(r, m ^), l,m = 1,..., 3, under the 
notational abuse r, 3 ^ := 7 j, and 

^T = dia S ( -h 2 ,i(t>(V2,i)m,i 

dr, k dr, k dF k \ a ^i J 

The derivations above make it clear that W,; is a symmetric 3-dimensional matrix whose generic 
element w\ m *, for l,m ^ 3, after some tedious algebra, is given by 


nh k ih m ^ 
77 


4>(m,i) ( t>(‘nm,i ) i,m^ 3 , 


^k n k 

and expressions for l,m = 3 can be obtained in a similar fashion as based on the quantities 
derived above. Finally, the z-th addendum defining the Hessian matrix is simply ~Hi = X^ W ? ;X,;, 
and the pseudo-data vector z,; = X^z? — is 3 x 1, with u,, = n^idFk/drik). 
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4.2 Background and Results 


HIV prevalence in a population is defined as the fraction of people who are infected or, expressed 
equivalently, as the probability that a randomly drawn individual has the disease. Accurate 
estimation of the HIV prevalence is essential to policy makers to develop effective control pro¬ 
grammes and interventions. Only in recent years, however, in countries where the diffusion of 
the virus is generalised epidemic, the lack of available administrative data has been overcome 


by the intensive use of population-based surveys (Boerma et ah, 2003). This is an important 
new source of data: prior to their introduction, national estimates have prevalently relied on 


some number of sentinel antenatal clinics (UNAIDS-World Health Organization, 2007), whose 


data may nonetheless present different sources of bias. First of all, their samples are based only 
on sexually active women who are pregnant and attend a clinic; secondly the location of the 
facilities, mostly concentrated in urban areas, may also induce a bias even in the subpopulation 
of pregnant women. These points have been elucidated and discussed with greater details in 


Montana et al. (2008) and Arpino et al. (2014), among the others 


On the other hand, participation rates for HIV testing in national surveys are generally low, 


and ranges from 72% for men to 77% for women in the 2007 Zambia DHS (Hogan et al. 2012), 


although even lower peaks are recorded in the 2004 Malawi DHS (63% and 70%, respectively). 
There are potentially many reasons inducing this pattern, including concerns, lack of incentive to 


participate, survey fatigue or migration of those targeted for interview (Gersovitz, 2011 Sterck 


2013. McGovern et ah, 2015); missing data on respondents’ HIV status represent therefore a 


not necessarily less severe cause of bias than the ones already mentioned above. This case study 
focuses on refusal to be tested for HIV, which is commonly regarded as the main reason of 
missingness in surveys. 

Notice, however, that the use of imputation or weighting techniques are likely to produce 
biased estimates if the selection mechanism does not occur at random, an assumption violated 
wherever the reasons of the refusal to test are caused by some unobserved factors. This is the 
case, for example, when individuals refuse to screen because they already know (or correctly 
predict) their HIV status, and fear others will learn about their seropositivity if they participate 
in the survey (McGovern et al., 2015). The framework introduced in this article allows us to 


estimate a Heckman-type selection model which is able to account for the possibility that data 
are missing not at random. Specifically, this is achieved by modelling item non-response as a 
function of unobserved variables that also affect the individual HIV status, and by specifying 
the selection mechanism together with an assumption on the distribution of the unobservables. 
To foster the identification of the causal mechanism in the study an exclusion restriction is 
imposed: that is we qualify the dependence of the missing data mechanism on a relevant 
variable independent of both the outcome of interest given the willingness to take the test, and 
the unobservables. This is usually labelled an instrument in econometrics and epidemiology, and 
the interviewer identity is regarded here as a valid instrument to be employed. In fact, previous 
researches, including Barnighausen et al. (2011), Hogan et al. (2012), Janssens et al. (2014) 


and McGovern et al. (2015), have successfully included such a variable in their studies, on the 


grounds that interviewer identity generally predicts consent to be tested, but it is unlikely it 
also affects the actual HIV status. 

A pictorial representation of the effects on the estimates of applying a sample selection 
model is reported in Figure [2j By comparison with the first map, the second one shows im¬ 
mediately how the simple imputation of the values under a random missingness assumption 
may severely underestimate the HIV prevalence in the Zambian provinces. The imputation has 
been conducted by making predictions from the univariate model upon discarding the missing 
values. The third map depicts instead how the association parameter of the employed copula 
varies among the different regions of the country, and it has been constructed by exploiting its 
dependence on the geographical location where the survey took place. 

Figure [3] then reports the smooth function estimates for the treatment and outcome equa- 
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Figure 2: First two panels: HIV prevalence for the male population in nine of the ten Provinces 
of Zambia (Northern, Muchinga, as well as part of Eastern have been merged because of the 
data availability) applying an imputation model not accounting for the possible presence of 
values missing not at random, and the corresponding estimates when a bivariate model is fitted 
instead, respectively. Third panel: the estimated absolute values of the association parameter, 
with range (l,oo), in a Joe copula rotated counterclockwise of 90°. The higher its value, the 
stronger the estimated association between the two equations; that is, the more relevant the 
impact of neglecting unobservables in the estimation of the HIV prevalence. The spatial effects 
are obtained here by specifying appropriately the penalty matrix as described in Section 


2.1 


tions, along with their different degrees of non-linearities and associated point-wise confidence 
intervals, when a Joego copula model is fitted to the Zambia DHS data; the subscript is used 
to denote the corresponding copula’s degrees of rotation. Notice that, compared to a bivariate 
Gaussian, the Joe copula is characterised by having a stronger dependence in one tail of the dis¬ 
tribution, and its choice for our study has been motivated by the implied negative association be¬ 
tween the two marginals, as we would expect to occur wherever persons refuse to be tested on the 
basis of some prior knowledge of their HIV+ status. Other existing competitors allowing for the 
same sign of association include models based on the bivariate Gaussian, Frank, Claytongo ; 27 o> 
Joe 27 o and Gumbelgo ; 27 o copulae, which are all implemented in SemiParBIVProbit and dis¬ 
cussed within a system of equations in Radice et al. (2015). As based on information criteria, 
we found that the Joego is the best fitting to the male population data, hence our decision to 
report only selected estimates obtained from this distribution. 

As a final remark, we shall stress that the assumption of distinct distributions may in 
principle lead to different estimates of the HIV prevalence (although it seems not to be an issue 
in this particular application as reported, for instance, by McGovern et al. 2015), as well as these 
can be impacted by the specific functional form of the covariates employed. To deal with this 
critic, some authors advanced instead the identification of a region (rather than of a singleton) 
of plausible values in which the parameters of interest necessarily lie, given the available data 
and the maintained assumptions. This switch from point to partial identification is discussed in 


general terms in 

Manski ( 

1995 

2003) 

and 

Horowitz and Manski 

(2000 

), and applied to a similar 

HIV context by Arpino et al. 

(201- 

:). Although theoretically valid and appealing, a major 


drawback of this approach is the realistic possibility of obtaining large width of the estimated 
bounds: this in turn may let the communication of any result to policymakers harsh even in 
the case where the identifying region is shrunken by the imposition of a monotone instrumental 
variable. 

Acknowledging this issue, our estimated model extends the traditional Heckman-type by 
accounting for three degrees of flexibility: the inclusion of non-parametric effects in the rep¬ 
resentation of the covariate-response functional form, the specification of bivariate copulae to 
detect more complex dependence structures than classical distributions usually assume, and 
the direct modelling of the association parameters in terms of some predictors. It is our hope, 
in this way, to conjugate both the point and partial identification strengths by providing the 
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Figure 3: Top panel: smooth function estimates and associated 95% point-wise confidence 
intervals in the treatment equation obtained by applying the Joego regression spline model on 
the 2007 Zambia DHS data. The method employed to derive the confidence bands is described 
in Supplementary Material |S.2[ Results are plotted on the scale of the linear predictor and 
the jittered rug plot, at the bottom of each graph, shows the covariate values.The smooth 


components are represented using low-rank penalized thin plate regression splines (Wood 2003) 


with basis dimensions equal to 10 and penalties based on second order derivatives. The numbers 
in brackets in the y-axis captions are the effective degrees of freedom of the smooth curves. 
Bottom panel: estimated smooth functions in the outcome equation. 


researchers with a set of flexible tools aimed at exploring the identifying region widely, and 
so to make better informed judgments about the robustness of their results wherever a point 
estimate is sought. 


5 Discussion 


This paper has devised a generic framework for the representation and estimation of a Gen¬ 
eralized Linear Model for a J-dimensional vector of discrete responses, with a ridge-type pe¬ 
nalisation term employed in the fitted algorithm. The resulting class of models allows us to 
include non-parametric and spatial covariate effects, among the others, as represented through 
the penalty matrix S.\. In this way, a baseline multivariate Generalized Additive Model has 
been effectively extended to encompass different kind of modelling instances within the same 


unifying framework. In fact, by translating the approach of Peyhardi et al. (2014) to the mul 


tivariate case, only the (r, Tj,Z) form and the matrix are formally required to apply the 
proposed estimation algorithm and related inferential results to different models in the class. 

In particular, once the class has been described in some generality, we have introduced a 
number of bivariate models employed in the literature to account for the possible presence of 
residual confounding in observational studies. The proposed representation provided us with 
a flexible machinery able to extend these models in various directions, foremost towards the 
additive semi-paranretric specification of the linear predictors in the spirit of (V)GAMs. This 
is, per se, already a relevant issue in applied research, since it permits to handle a data-driven 
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representation of the covariate-response relationship and hence to alleviate a possible source of 
bias stemming from model mis-specification. Moreover, we have described how the framework 
can be further specified in order to include multivariate distributions as computed by copulae 
of univariate marginals: some analytical results have been derived for Normal marginals within 
a bivariate dichotomous regression model. 

A further feature illustrated by the article has been the direct modelling of any copula 
association parameter in terms of known predictors. As shown in the analysis of non-random 
sample selection for the 2007 DHS Zambia dataset, this characteristic is attractive as it allowed 
to quantify the strength of the unobservables within the different provinces of the country, and 
this in turns enabled us to provide new insights about the severity of the non-response issue 
in the study. In particular, Figure [2] showed that the magnitude of the copula association 
parameter can vary considerably even between geographically close provinces, like Northern 
and Luapula in the example. On this point, the relevant literature has already stressed that 
demographic and environmental factors, like the presence of cities or high density housing, may 
impact the estimates of the HIV prevalence. Hence, the combination of this knowledge with 
the possibility of letting the association parameters depend on observed variables seems to us 
an attractive feature that could be investigated more closely. 

As a natural specification of the proposed framework, the practical implementation of models 
involving ordinal responses are being developed, whereas the estimation of higher dimensional 
systems of equations is still limited by the necessity of computing multivariate integrals with 
a good degree of accuracy. Under this respect, the exploitation of a more comprehensive class 
of models for copula distributions may be beneficial, possibly by allowing the non-parametric 
estimation of the marginals and/or the corresponding copulae. These are only some of the 
possible avenues of future research that will be undertaken. 
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Supplementary Material to 
“Discrete Responses in Bivariate Generalized 

Additive Models” 


S.l Asymptotic Behaviour of the Estimator 


This section provides some arguments about the asymptotic behaviour of the proposed MPL 


estimator. Analogous results were also achieved by Kauermann (2005) in the context of survival 


models, and by Radice et al. (2015) for a bivariate system of dichotomous outcomes. Although 


our derivations are based on the somehow theoretically stringent assumption that the dimen¬ 
sion of the spline bases does not increase with the sample size, this instance is still worth to 
be considered because, in practice, the bases’ dimensions have to be fixed in order to achieve 
estimation. Nonetheless, by taking the number of the bases relatively rich such to appropri¬ 
ately describe the unknown curves in the model, it is possible to assume heuristically that the 
approximation bias is negligible compared to the estimation variability (Kauermann, 2005). To 
the best of our knowledge, at present the relaxation of this assumption has been confined to 
the sole analysis of B-splines for their convenient representation and handling as, for instance, 


did Kauermann et al. (2009). Therefore, despite the theoretical relevance of these results, they 


still do not encompass the whole range of smooths allowed by this work. 

Let i9o be the “true” parameter vector, in the sense that it induces the best approximating 
likelihood corresponding to the structure that has generated the data. Namely, Ro is set the 
minimiser of the Kullback-Leibler discrepancy 

R 0 := argminKL (Ct\C n ) = E[£ t - £ n (R)], 
tfe© 

where the expectation above is carried out with respect to the true model distribution. As a 
consequence, by direct differentiation of the above equation, we are implicitly defining $o to 
be the vector such that 0 = E[V^ 0 ^ n ]. For the proceeding analysis we rely on the following 
regularity conditions: 

(A.l) Vtf 0 4 = O^n 1 / 2 ); 

(A.2) = O(n); 

(A-3) V* 0 *x4 - E[V„ o< 4] = Optn 1 / 2 ); and 
(A.4) = o(n 1 / 2 ). Following 


Kauermann 


(2005), this assumption can be equivalently re¬ 


stated as A j^m-j = o(n 1//2 ) from the very construction of the penalty matrix, and from the fact 
that its dimensionality is taken fixed as n increases. 


The above (A.l) (A.3) are the standard conditions for the consistency of the unpenalized ML 
estimator, whereas |(A.4) ensures that, in the large sample limit, V\ becomes irrelevant for the 
fitting. For a further investigation, we also need an additional condition aimed at describing 
the behaviour of the log-likelihood third derivatives, and it guarantees the asymptotic Normal 
distribution of the score: 

(A.5) for every t) s E $, (<9 3 / d'd s3 )l n (fi)) exists and satisfies for every point x E R and every 
parameter in the neighbourhood of |(<9 3 /cM s3 )4 l (R)| < M{x ), with E[M(x)|i?g] < oo; and 
let 0 < Z(i?5) < oo. 

Then it follows: 


Proposition 2. Let $o be the parameter vector defined as in ( pl| ) and assume that conditions 
(A.l) (A.5) hold; then the MPL estimator ■d := argmax 1?e0 [^(t?) — 1/2$ T S,\#] satisfies 


R - R 0 = F- 1 (A)(V^/(R 0 ) - S A 0 O )[1 + o p (l)], 


(SI) 
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In particular, the leading stochastic component in (SI) 


with F _1 (A) = (S a -E[V^t£(^o)])- 1 . 
has asymptotic order O p (n -1 / 2 ). 

Proof. We first set the notation. Let us denote by IP the j-th component of the parameter 
vector d = (i? 1 ,... ,d p ) T , and define subsequently £ p j := d£ p /dd^ the partial derivative of the 
penalized log-likelihood with respect to d J ; higher order derivatives are denoted subsequently. 
Also, the “hat” notation t p stands for £ p (d), while the convention of omitting the listing of 
parameters is used wherever the relevant quantities are evaluated at the best coefficient do, 
that is £ p := £ p (do). 

Using the Einstein summation convention, we expand £ p ^ r around £ p ^ r using a second order 
Taylor approximation: 


0 — £p,r — t 


e p ,rs(V - #o) S + f P ,rst(# - #o) St + 


with (D—Do) s := d s — dp and (d—do) st = ($— do) s (d— 'do)*- Solving the above equation for d—do, 


and denoting by superscripts the inverses of the respective quantities, we get (Barndorff-Nielsen 


and Cox, 1994): 


(o - d 0 y = -r p %, 8 - + 


(S2) 


where £Zf v := £ rs £ t p£™£ P)S t v , and £ rs is the (r, s)-th element of the inverse observed (penalized) 


2005): £ 


Fisher Information. Equation (S2) can be simplified as follows (see, for example, Kauermann 
p,rs • — frsi. A) T r rs , 


where f rs ( A) := f rs { 0) — s r f( is the penalized expected Fisher 
Information contribution: f rs ( 0) := K[d£/dd r dd s ], and r rs := I rs — f rs ( 0). 

Under assumptions (A.2) and |(A.4)| we find that f rs ( A) is of asymptotic order 0{n ), and 
that r rs = O p (n 1//2 ) directly from (A.3) We can then simplify the first term of (S2) as 


-f p s = E[£ p , r £ p , 


i-i 


+ IE[£p , r £p,t] E[^p,s^p,u. 

- -f rs (X) + f rt (X)f su (X)(- frsW + l 


- 1 / 


(®[^p,^p,u] + £p,tu) 
p,tu)l 


Kauermann et al. 


(2009) we 


that is £p S = f rs ( A) — f rt (X)f su r tu ; following now the argument of 
have 

t r p s = f rs { A)[l + 0{n- l )O p {n 1 / 2 )] = f rs ( A) [1 + O^rT 1 ! 2 )). 

We need to characterise next the order of £ff v , which in turns depend on the one of £ p . s tv First 
note that £ p<s tv = £stv from the very construction of the penalized likelihood estimator, so that 


we can safely apply (A.5), implying that we can bound in probability the third derivative of 


the log-likelihood. Then, by the strong law of large numbers, we have that, for almost every 
sequence of {x \,..., x n } and every d 6 0, 


n 




stv I _ 


< n 


-E 


M{ Xi ) ^ E [M(x)\ 


as n —> oo, hence n £ s tv = O p ( 1). It is then implied l s t v = O p (n) and, after some tedious 


,- 2 \ 


computations, £ r p tv = f rs {X)f tu (X)f vw (X)O p (n ) = O p (n 
£ p ,u = O p (?r 1 / 2 ) — o(n 1 / 2 ). We also find that £ p s £ p , s has order O p (n -1 / 2 ) + o(n -1 / 2 ), that is 


so that £ r p tv £ p ,Jp,w = O p (n x ) since 


the second addendum in (|S2|) becomes asymptotically negligible compared to £p S £ P:S . We can 


then write (d — do) r = —f rs (X)£ PiS [l + o p (l)], whose leading terms, in matrix notation, are 
F - 1 (A)(Vtf 0 £($o) ~ S a $o ) 5 from which the assertion follows. 

The stochastic order of the above terms then stems from f rs (X)£ P>S = O p (n~ 1/ ' 2 )+o p (n~ 1/ ' 2 ) = 
o p (n~ 1//2 ). ’ m 
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The above derivations also allows us to characterise the bias and the variance of the MPL 
estimator, as well as their corresponding asymptotic orders. In particular, we find that 

E[$] — i9 0 = F~ 1 (A)Sai?o[ 1 + o(l)] (S3) 

and 

Var[£] = -F- 1 (A)E[V 1 ?o< ^ 0 )]F- 1 (A)[1 + o(1)], (S4) 

with orders of o(n _1 / 2 ) and 0(n -1 ), respectively. In fact, we immediately obtain the (asymp¬ 
totic) equivalence 

(d-v 0 y = -r(\)t P Ai + o P (i)} 

from which 


and 


E[(* - *>) r ] = -ELT (A)(4 - 4^)][1 + 0(1)] = / rs «[ 1 + o(D] 
Var[?n = (/-(A)) 2 ¥ar[4][l + o(l)] = -(/ rs (A)) 2 /r S (0)[l + o(l)]. 


(S5) 


Finally, invoking (A.2) and (A.4) and since f rs ( A) is 0(n i ), we compute E[(i? — '!?o) r ] = 


0{n 1 )o(?i 1 / 2 ) = o{n ¥ 2 ), while Var[i? r ] is led by terms of order 0(n 2 )0{n) = 0{n 


-U 


S.2 Confidence Intervals Computation 


At convergence of the estimation algorithm, the penalised GLS representation of the model 
induces a covariance matrix of the estimator of the form Vg = which can in principle 

be used to compute the standard errors of each component of i9. An appealing alternative 
approach to conduct inference, however, is to advocate a Bayesian reasoning as based on the 
posterior distribution of $|w 

0|w ~ A r p ([D T WD + S a ]' 1 D t Wz, [D t WD + S A ] _1 ), 


which is equivalent to choose $|v ~ A /),($, V$), with V# := The Bayesian framework 

above emerges naturally from the specification of the model through a roughness penalty ap¬ 
proach. In effect, as Wahba (1983) and Silverman (1985) recognised, the imposition of any 


kind of penalisation in the estimating procedure corresponds to the explication of some kind of 
prior beliefs about the likely features of the true model. Specifically, the definition of a nor¬ 
mal prior for $, /# oc exp{—1/2 i9 t Sa$}, implies that smoother models are more likely than 


wiggly ones, while it gives equal probability density to all models of equal smoothness (Wood 


2006b). The stated posterior distribution is then a consequence of the asymptotic normality 
of w := D t Wz. Upon re-writing w = DWDt? + D T u, it holds that the last addendum is 


aymptotically bounded by a random vector with distribution Af p (0 p , D WD) because of (A.5) 
whereas the first one converges in probability to D t WDi? from which the desired distribution 
follows. 

For the construction of confidence intervals of the non-parametric model components, the 
employment of is usually preferred to Vg. In fact, as argued by Marra and Wood (2012) 
in the context of GAMs, the former can produce intervals with close to nominal “across-the- 
function” frequentist coverage probabilities, as resulting from the inclusion in of both a bias 
and a variance component, a feature which is not shared instead by Vg. A key requirement 
at the basis of the result is that the magnitude of the bias component is substantially of a 
small portion compared to the sampling variability, an occurrence that is guaranteed wherever 
heavily over-smoothing is prevented (Nychka 1988). Point-wise confidence intervals for the 


estimated non-parametric curve are then obtained from Af{sj t i j bj ^ 
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Algorithm Si Approximate (1 — a)% Confidence Interval (CIi Q ) for T(d) 

Require: i?; D; W; S A ; N sim 
Vtf <= (D t WD + S A ) _1 
draw ~ M p (d, V#(d)) 

TO <= {T($*)} r 

T := {T^y . . • , SUC ^ that ^(r) — -^(r+l)’ =) 1 • • • 1 -^sim 

define T* as the smallest [A s j m a]-th value of T* 

cii-« <= K /2 ,r;_„ /2 ] 


where ^ ] is the sub-matrix of corresponding to the parameters associated to the (j, lj )- 
th smooth. 

More generally, confidence intervals for a non-linear function T(d) of the MPLE can be con¬ 
structed by a convenient simulation scheme from the posterior distribution i?|w, as illustrated 
in the pseudo-code given in Algorithm [Si} 

To conclude, the asymptotic equivalence between the frequentist and the Bayesian variance 
estimators can be establised as follows: 


Corollary 1 (to Proposition [2]) . Under assumptions (A.l ) \(A.5) the re-scaled frequentist and 
Bayesian variance estimators \/nVg and y/nV&, respectively, converge to the same quantity 


- v^E-^V^t^o)] 


as n —> oo. 


Proof. Using equation (S5) above, we derive 


Vai#] = (/-(A)) 2 ¥ar[4][l + o(l)] = -(/"(A)) 2 /„(0)[1 + o(l)], 

from which 

\/n¥ar[t? r ] = -y/n(f rs (0 ) - s A s )~ 2 / rs (0)[l + o(l)] 

~ -Vn" 1 (y/n^f rs (0 ) - o(l)) -2 / rs (0) « -Vn(f rs (0)) 2 f rs {0) = -y/n(f rs (0 )) _1 ; 


this corresponds immediately to the statement once written in matrix notation. Similarly, we 
can compute the asymptotic limit of the Bayesian variance estimator: reminding that V$ = 
— analogous arguments as above lead to: 

- Vni r p s = -Vnf rs (X)[ 1 + 0 (1)] « -y/n(f rs (' 0)) _1 
which concludes the proof. ■ 


S.3 Proof of Proposition [T] 

Let us consider first a Taylor expansion of — 2£(d*) about d: 

- 2t(d*) « -21(d) - 2(d* - d) T V#t(d) - (d* - d) T V mT £(d)(d* - d), 

and recall that, in the large sample approximation, —= D T WD —D T WD, hence 
we can write the addenda in the above expression as: 

(d* - tf) T D T WD (d*-d) = ||VW(z -DiT)|| 2 + ||VW 1 u\\ 2 - 2 (z - Dd* ; u) 
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and 

(0* - 0) T D T u = || VW u|| 2 - (z - Dtf*;u) . 

Then we have 

- 21(0*) « -2^(0) - || VW u|| 2 + ||VW(z - DiT)|| 2 

and, by noticing that the smoothing parameter vector affects the latter approximation only 
through the updated iteration 0* , and that we are interested in optimising a criterion with 
respect to A, it is lecit to drop any addendum not depending on it. Hence, one can indifferently 
consider an equivalent UBRE criterion given by 

V u oc || \/W(z - Di?*) || 2 + 2tr(P) = 2tr(P) - 


S.4 Data Generating Process Employed in Figure [T] 

The simulation results depicted in Figure [l] comprises a bivariate system of equations specified 
by the following Data Generating Process (DGP): 



X M + 2x2,1 + x 3) , + si,i(v M ) + Si, 2 (v 2 ,i) + ei,i 
-0.3+ xpj - 2x 2)i + S 2 ,i(vi,i) + e 2 ,i 



for a sample size of 10,000 observations, and N = 100 replications. The test functions are 
displayed in red in the figure, and given by si,i(vi,j) = 1 — vi,j + 1.6v^ i — sin(5vi,i), si, 2 (v 2> j) = 
4v 2 ,j and s 2j i(vi i j) = 0.08{vJ 1 J10(l — vi,j)] 6 + 10(10vi,j) 3 (l — vi,*) 10 }. Furthermore, the ordered 
values of y ]:l have been computed following the observation rule: 


Vj,i 


^ ] kj'&Cjj-i<y ?, 
kj G/C j 


for every j E {1,2}, and obtained by setting the threshold parameters at Ci := (—2, — 1,0, 2) T 
and e 2 := (-1.4, -0.7, -0.2, 0.7,3) T . 
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